The workstation used for the smaji cjkv project was built from wastes.

The glyph collection of cjkv contains more than ten thousand samples. To process such a quantity of images requires the cpu, memory and disk to be of high throughput capacity.

And as a long-term project intended to last for decades, data safety is one of the most concerned issues.

Computing

IMHO, the most cost-effective high throughput CPUs in the market are those in the retired servers. Due to the internet boom, servers were retired/replaced at high frequency, resulting in much waste. The number of 'wastes' is so enormous that people has to pay money to dispose them. On the bright side, it costs way lower price to build a workstation from these wastes.

We have two choices: Intel Xeon and AMD EPYC. For example, a second-hand Xeon E5-2696 v3 (PassMark rating 23343) costs only 320 CNY(45 USD), meanwhile, an EPYC 7251 (PassMark rating 14935) costs 440 CNY(64 USD). Intel Xeon is a much better choice unless you already have an AMD motherboard.

Server CPUs are also beneficial to memory and PCI peripherals. Xeon 2696 v3 has 40 PCIe lanes, so we can setup dual graphic cards(PCIe x16 x 2) without performance penalty. For comparison, a Ryzen 5600 CPU has only 24 PCIe lanes, the first graphic card takes up 16 lanes, so the second graphic card can only perform on the remaining 8 lanes i.e. half of 16, which incurs performance penalty.

As already mentioned, retired server memory chips are cheap enough. For example, We can get 4 ECC DDR4 memory chips, 16GiB each, to build a quad channel memory system and these chips cost only 400 CNY(58 USD).

Storage

Data safety! The critical importance is beyond doubt. This doesn't mean we have to build our storage pool with high price top-level disks and replace the disks regularly. As the storage capacity increased, now the biggest risk becomes URE(Unrecoverable Read Error) instead of disk damage. A RAID5 array can withstand one disk failure, but to replace the disk and rebuild the array is not URE-tolerant. Encountering URE, bang!! the rebuilding RAID5 array is screwed up.

Instead of buying some expensive disks, we can buy a few more cheap disks and build a RAID6 array. ZFS raidz2 is another good solution. These two solutions can withstand two disks failure. As URE is almost impossible to occur at the same position in any two disks of the array, rebuilding will be fairly safe.

Next, in case of fire disaster that burns the RAID array altogether, we need a RDTS(Remote Disaster Tolerant System). We can simply build another RAID6/raidz2 array as a RDTS and sync them regularly. Done!

In fact, my storage system was also built from wastes(retired server disks). And an SSD disk was added for the raidz2 pool as log and cache device. Cheap, stable, robust and efficient.

GPU and KDE

I'm a Debian and KDE user. Debian Stable is so stable that some of its components are a bit outdated. Because of CUDA, I need new GPUs and have to keep pace with new nvidia drivers. The recent nvidia driver 525 broke the backward compatibility with old kde component KSysGuard. It can't monitor nvidia graphic cards any more. And the old KSysGuard in Debian Stable(currently Debian 11) can't monitor multi-gpu, so I wrote two patches to fix these issues.

The version of KDE plasma in Debian 11 is 5.20, so we can clone the KSysGuard components from https://invent.kde.org/plasma/ksysguard.git/ and check out the Plasma/5.20 branch.

The new nvidia driver update

Nvidia driver version 525 changed its nvidia-smi output, which KSysGuard is not aware of. Here is the patch.

Multiple GPUs

And the multi-gpu enhancement.

- ZAN DoYe


Comments

comments powered by Disqus

© 2024 ZAN DoYe